Sparse PCA via Covariance Thresholding
نویسندگان
چکیده
In sparse principal component analysis we are given noisy observations of a low-rank matrix of dimension n × p and seek to reconstruct it under additional sparsity assumptions. In particular, we assume here each of the principal components v1, . . . ,vr has at most s0 non-zero entries. We are particularly interested in the high dimensional regime wherein p is comparable to, or even much larger than n. In an influential paper, Johnstone and Lu (2004) introduced a simple algorithm that estimates the support of the principal vectors v1, . . . ,vr by the largest entries in the diagonal of the empirical covariance. This method can be shown to identify the correct support with high probability if s0 ≤ K1 √ n/ log p, and to fail with high probability if s0 ≥ K2 √ n/ log p for two constants 0 < K1,K2 < ∞. Despite a considerable amount of work over the last ten years, no practical algorithm exists with provably better support recovery guarantees. Here we analyze a covariance thresholding algorithm that was recently proposed by Krauthgamer, Nadler, Vilenchik, et al. (2015). On the basis of numerical simulations (for the rank-one case), these authors conjectured that covariance thresholding correctly recover the support with high probability for s0 ≤ K √ n (assuming n of the same order as p). We prove this conjecture, and in fact establish a more general guarantee including higher-rank as well as n much smaller than p. Recent lower bounds (Berthet and Rigollet, 2013; Ma and Wigderson, 2015) suggest that no polynomial time algorithm can do significantly better. The key technical component of our analysis develops new bounds on the norm of kernel random matrices, in regimes that were not considered before. Using these, we also derive sharp bounds for estimating the population covariance, and the principal component (with `2-loss). c ©2016 Yash Deshpande and Andrea Montanari. Deshpande and Montanari
منابع مشابه
Do Semidefinite Relaxations Solve Sparse Pca up to the Information Limit ?
Estimating the leading principal components of data, assuming they are sparse, is a central task in modern high-dimensional statistics. Many algorithms were developed for this sparse PCA problem, from simple diagonal thresholding to sophisticated semidefinite programming (SDP) methods. A key theoretical question is under what conditions can such algorithms recover the sparse principal component...
متن کاملDiscussion of large covariance estimation by thresholding prin- cipal orthogonal complements
We congratulate the authors on a very interesting contribution, which takes the fundamentally important field of covariance matrix estimation in some important new directions. We agree that now is a good time to be studying asymptotic contexts, where the first K eigenvalues of Σ grow quickly. The asymptotic mode of the sample size tending to infinity, with an exponentially growing dimension can...
متن کاملHigh-dimensional Analysis of Semidefinite Relaxations for Sparse Principal Components1 by Arash
Principal component analysis (PCA) is a classical method for dimensionality reduction based on extracting the dominant eigenvectors of the sample covariance matrix. However, PCA is well known to behave poorly in the “large p, small n” setting, in which the problem dimension p is comparable to or larger than the sample size n. This paper studies PCA in this high-dimensional regime, but under the...
متن کاملPositive-Definite 1-Penalized Estimation of Large Covariance Matrices
The thresholding covariance estimator has nice asymptotic properties for estimating sparse large covariance matrices, but it often has negative eigenvalues when used in real data analysis. To fix this drawback of thresholding estimation, we develop a positive-definite 1penalized covariance estimator for estimating sparse large covariance matrices. We derive an efficient alternating direction me...
متن کاملAdaptive Thresholding for Sparse Covariance Matrix Estimation
In this article we consider estimation of sparse covariance matrices and propose a thresholding procedure that is adaptive to the variability of individual entries. The estimators are fully data-driven and demonstrate excellent performance both theoretically and numerically. It is shown that the estimators adaptively achieve the optimal rate of convergence over a large class of sparse covarianc...
متن کامل